GitHub

The code for the herein described process can also be freely downloaded from https://github.com/fzenoni/london-accidents.

Baby, you can drive my car

Besides being able to “put dots on a map”, R can be used in very effective ways to do spatial analytics. Last month, Stefano already provided descriptive statistics on a Kaggle dataset that includes 16 years of UK accidents.

It is now time to move up a gear (no pun intended), and tackle one of the many ways of extracting spatial insights from the data. In this post, we are going to ignore the potential analysis brought by the time series. It will certainly be the topic of another post.

Instead we will try to propose an answer to the following questions: what if the government wanted to highlight the areas of a city showing some alarming characteristics, with a given statistical significance? What if we wanted to discover what are London’s most dangerous roads or intersections for car drivers?

In fact, London happens too large for our pruposes. I will investigate the area enclosed by a radius of 4 km, but the method shown in the following stays valid at any scale.

“80% of the data analyst job”

First things first, we load the data and clean it a bit. The fastest way to do it in-memory, while enjoying the functions devoted to tables, is still to use the data.table package, together with the selection of the strict minimum amount of columns.

# Load data
set <- list.files(path = '1-6m-accidents-traffic-flow-over-16-years',
                  pattern = 'accidents.*csv', full.names = TRUE)
cols_to_keep <- c('Location_Easting_OSGR', 'Location_Northing_OSGR', 'Number_of_Casualties')
data <- lapply(set, fread, select = cols_to_keep) %>% rbindlist
# Filter out empty locations
data <- na.omit(data, cols = cols_to_keep)
# Remove duplicates
data <- data[!duplicated(data)]

data.table is nice and all, but since we work with spatial data we’re going to use the sf format, as we did in the past. As sf does not exactly extend data.table, I’m going to cast the table to a data.frame first. Note the CRS, corresponding to the British Ordnance Survey National Grid.

data <- st_as_sf(data.frame(data), coords = c('Location_Easting_OSGR', 'Location_Northing_OSGR'), crs = 27700)

This data include accidents over 16 years for the whole of UK. It represents a lot of records, and performance wise, I don’t necessarily have a strategy in place to analyze them all. As a first move, I’m going to select the events that fall inside the boroughs of London administrative boundaries. The Kaggle dataset luckily includes the geoJSON of UK’s districts.

Until very recently, I’ve had mixed feelings in the past concerning this format. I changed my mind thanks to the discovery of two methods to open and import it as sf.

The first one is the now classic sf::read_sf().

system.time(sf::read_sf('1-6m-accidents-traffic-flow-over-16-years/Local_Authority_Districts_Dec_2016.geojson'))
##    user  system elapsed 
##   8.794   0.721   9.918

But the freshest discovery is the geojsonsf::geojson_sf() function from SymbolixAU (check their blog post post here), that serves exactly our purpose, in an even faster way.

system.time(map <- geojsonsf::geojson_sf('1-6m-accidents-traffic-flow-over-16-years/Local_Authority_Districts_Dec_2016.geojson'))
##    user  system elapsed 
##   2.569   0.151   3.454

Let’s quickly inspect the map object.

head(map)
## Simple feature collection with 6 features and 10 fields
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: -2.832457 ymin: 53.30503 xmax: -0.7884304 ymax: 54.72717
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##    bng_e  bng_n                       geometry   lad16cd
## 1 447157 531476 MULTIPOLYGON (((-1.268456 5... E06000001
## 2 451141 516887 MULTIPOLYGON (((-1.243904 5... E06000002
## 3 464359 519597 MULTIPOLYGON (((-1.137578 5... E06000003
## 4 444937 518183 MULTIPOLYGON (((-1.317286 5... E06000004
## 5 428029 515649 POLYGON ((-1.637678 54.6171... E06000005
## 6 354246 382146 MULTIPOLYGON (((-2.626835 5... E06000006
##                lad16nm lad16nmw      lat     long objectid st_areashape
## 1           Hartlepool          54.67616 -1.27023        1     93559511
## 2        Middlesbrough          54.54467 -1.21099        2     53888581
## 3 Redcar and Cleveland          54.56752 -1.00611        3    244820281
## 4     Stockton-on-Tees          54.55691 -1.30669        4    204962233
## 5           Darlington          54.53535 -1.56835        5    197475689
## 6               Halton          53.33424 -2.68853        6     79084033
##   st_lengthshape
## 1       71707.33
## 2       43840.85
## 3       97993.29
## 4      119581.51
## 5      107206.28
## 6       77770.94

Now I would like to extract London’s borough, but I am bored by having to inspect the map, and the need to learn new geography. I am the laziest member of the team, and I refuse to copy-paste 31 names into a list. Therefore, I decided to harvest this Wikipedia page with the rvest package, and use the list to filter the regions of interest.

# Harvest the list of London boroughs from Wikipedia
wiki_london <- xml2::read_html('https://en.wikipedia.org/wiki/London_boroughs')
boroughs1 <- wiki_london %>% rvest::html_nodes('ol') %>% .[[1]] %>% rvest::html_text()
boroughs2 <- wiki_london %>% rvest::html_nodes('ol') %>% .[[2]] %>% rvest::html_text()

list1 <- as.list(strsplit(boroughs1, "\n")) %>% unlist
list2 <- as.list(strsplit(boroughs2, "\n")) %>% unlist
list <- c(list1, list2)

# Special cases to fix
list <- replace(list, list=='City of London (not a London borough)', 'City of London')
list <- replace(list, list=='City of Westminster', 'Westminster')

# Filter map
london <- map %>% dplyr::filter(lad16nm %in% list)
# Unite the boroughs
london_union <- london %>% sf::st_union()

The London map is ready to be displayed.

plot(london_union)

As mentioned before, originally I wanted to analyze all of London’s data (if not all of UK), but it seems that to do within today I would need to parallelize a larger part of the analysis code (I have some plans, so maybe it will be a task for a post addendum). Instead, I will analyze the data falling into a radius of 4000 m from London’s centroid. For sf aficionados, this is a trivial task.

# Project to British National Grid (http://epsg.io/27700)
data <- data %>% st_transform(crs = 27700)
london_union <- london_union %>% st_transform(crs = 27700)

# Build a circle
center <- st_centroid(london_union)
max_radius <- 4000
london_circle <- st_buffer(center, max_radius)

# Filter data thanks to map
london_data <- data[london_circle,]

This is what we got. I know, I know, it’s a small sample!

# Original amount of data
nrow(data)
## [1] 1281716
# Filtered data
nrow(london_data)
## [1] 18069
# Draw the area
plot(london_union)
plot(london_circle, add = T, col = sf.colors(n = 1, alpha = 0.3))

# Display the accidents as points
mapview(london_data, map.types = "Esri.WorldImagery")

Spatial data

We’re now ready to inspect some spatial data. I will split the accidents in two categories. This distinction is highly arbitrary: in order to be able to use the final results (i.e. where shoud Sadiq Khan spend taxpayer’s money to increase security), more understanding of the (meta-)data is needed. But once again, even with a suboptimal decision at this stage, the method stays valid.

I decided to split the data into Severe and Non-Severe accidents, the former involving more than one casualty, the latter strictly one. To visualize and work on spatial densities, a special kind of object, point pattern dataset, coming from the spatstat package, will be used. The package is not (yet) able to deal with sf, so we’re going to hold our noses and go back for a moment to sp.

# Create spatstat ppp object piece by piece
# Define window
london_owin <- sf::as_Spatial(london_circle) %>% polyCub::as.owin.SpatialPolygons()
# Extract accident coordinates
london_coords <- sf::st_coordinates(london_data)
# Build "marks" or features
london_marks <- as.factor(ifelse(london_data$Number_of_Casualties == 1, 'Non-Severe', 'Severe'))
# Define ppp
london_ppp <- spatstat::ppp(x = london_coords[, 1], y = london_coords[, 2],
                  window = london_owin,
                  marks = london_marks)

With such an object, we can use spatstat to quickly display useful information. We start by showing the fraction of Severe accidents over the total.

# Split the data according to the "marks"
london_splits <- split(london_ppp)
# Compute densities
accident_densities <- stats::density(london_splits)
# Display fractional density
frac_severe_accidents <- accident_densities[[2]]/(accident_densities[[1]] + accident_densities[[2]])
plot(frac_severe_accidents)

This plot is cool, but it just tells us that the severe accidents are present in that area with a percentage that oscillates between approx. 13% and 19%. Are the highest concentrations meaningful? Are some areas actually more dangerous than others? Aren’t they simply the result of random fluctuations?

There is one way to find out. The technique I’m going to apply is called spatial segregation, in the context of point pattern analysis. I am going to display the density of accidents - split in two custom categories - thanks to kernel density. A smooth curve will be fitted on the data: highest values will correspond to the location of points, and these values will diminish as the distance from the point increases. However, a crucial parameter must first be determined: the bandwidth. Choose it too small, and the density will be at the maximum exactly where the points are, and zero elsewhere; choose it too large and the density will appear as a blob without any features.

spseg(opt = 1) from the spatialkernel package will assist us in providing the best bandwidth value given our densities. The execution of the workforce function, spatialkernel::cvloglk, is resource intensive, and we’re going to use parallel computation (as long as you’re on Mac or Linux). I leave out the details, but you can check this blog post for more information.

h <- seq(500,550,5)

os <- Sys.info()['sysname']

if(os == 'Windows') {
  bw_choice <- spatialkernel::spseg(pts = london_ppp,
                     h = h, opt = 1)
  
  spatialkernel::plotcv(bw_choice); abline(v = bw_choice$hcv, lty = 2, col = "red")
  
} else if(os == 'Linux' | os == 'Darwin') {
  
  no_cores <- max(1, parallel::detectCores() - 1)
  cl <- parallel::makeCluster(no_cores, type = 'FORK')
  cv <- parallel::parLapply(cl, h, function(h)
    cvloglk(pts = as.matrix(spatstat::coords(london_ppp)),
            h = h,
            marks = as.character(spatstat::marks(london_ppp)))$cv
  )
  
  parallel::stopCluster(cl)
  
  bw_choice <- data.frame(x = h, y = unlist(cv))
  plot(bw_choice, type = 'l')
  max_loglk <- which.max(bw_choice[,2])
  abline(v = bw_choice[max_loglk, 1], lty = 2, col = "red")
} else {
  stop('OS not supported.')
}

This plot indicates the value maximizing the cross-validate log-likelihood function of the bivariate Poisson point process, which describes our point distribution. This value will be carried over in the next step.

How can we assess if a higher concentration of Severe accidents is statistically significant. A classic way to proceed is: 1. to generate a large number of Monte-Carlo simulations that randomly change the location of the accidents, 2. to compute some test statistic for every one of them, 3. and then to check how our data’s test statistic compare to the simulated ones. This is done for all pixels on a grid covering our region of interest. The following step generates 1000 MC and took hours to compute. Its output has already been loaded for you behind the scenes.

seg1000 <- spseg(
  pts = london_ppp,
  h = bw_choice[max_loglk, 1],
  opt = 3,
  ntest = 1000,
  proc = FALSE)
spatialkernel::plotmc(seg1000, 'Severe')

This plot shows two levels of the p-values of the Severe accidents locations in our data: one for p-value equal to 0.05, and the other equal to 0.95. This p-value represents the statistical significance with which we reject (or not) the null hypothesis of a random distribution of Severe vs Non-Severe accidents. The lower is the p-value, the stricter is our rejection test. In other words, the areas with a p-value below 0.05 show a concentration of Severe accidents that doesn’t seem to be fortuitous.

Downloading the raster

For a decision maker, the plot above is not informative at all, because we don’t get to see to which zones the risky areas correspond. We can of course solve this by downloading a raster of the area of interest.

# The download needs lat-lon parameters
mybb <- sf::st_bbox(sf::st_transform(london_circle, crs=4326))
area <- extent(mybb['xmin'], mybb['xmax'], mybb['ymin'], mybb['ymax'])
r <- raster()
extent(r) <- area

# Download from Google Maps
gm  <- dismo::gmap(x = r, type = "roadmap", scale = 1, zoom = 13, rgb = TRUE)

# We project the raster back to crs = 27700
uk_proj4 <- '+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs'
# method = 'ngb' is necessary to preserve the RGB colors
gm2 <- projectRaster(gm, crs = uk_proj4, method = 'ngb')

As a final step we generate the complete picture. We choose a p-value threshold of < 0.01, meaning that we have one in a hundred chance of being wrong.

# Rearrange the probability column into a grid
ncol <- length(seg1000$gridx)
prob_severe <- list(x = seg1000$gridx,
                    y = seg1000$gridy,
                    z = matrix(seg1000$p[, "Severe"],
                               ncol = ncol))


# Rearrange the p-values, but choose a p-value threshold
p_value <- list(x = seg1000$gridx,
                y = seg1000$gridy,
                z = matrix(seg1000$stpvalue[, "Severe"] < 0.01,
                           ncol = ncol))
# Create a mapping function
segmap <- function(prob_list, pv_list, low, high){
  
  # background map
  # plotRGB(test2)
  plotRGB(gm2)
  
  
  # p-value areas
  image(pv_list, 
        col = c("#00000000", "#FF808080"), add = T) 
  
  # probability contours
  contour(prob_list,
          levels = c(low, high),
          col = c("#206020", "red"),
          labels = c("Low", "High"),
          add = TRUE)
  
  # boundary window
  plot(Window(london_ppp), add = TRUE)
}

# Map the probability and p-value
segmap(prob_severe, p_value, 0.13, 0.17)

The chosen probability values (0.13 and 0.17) are completely custom and depend on the fraction of event carried by the data. If you analyze a different area, these numbers will certainly be different if you want to keep a good visual effect.